Reafctoring to use CosmosError instead of azure_core:Error#4442
Merged
analogrelay merged 138 commits intoMay 28, 2026
Merged
Conversation
…to users/fabianm/ResponseHeadersErrors
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors the Cosmos SDK + driver to use a first-class CosmosError/Result<T> instead of azure_core::Error, and introduces rate-limited backtrace capture on Cosmos errors for improved diagnostics.
Changes:
- Introduced
azure_data_cosmos::CosmosError+azure_data_cosmos::Result<T>(and driver equivalents), migrating public APIs/tests/examples to the new error surface (status_code(),sub_status(), typed headers, predicates). - Added driver-side backtrace capture with rate limiting and lazy symbol resolution + caching, configurable via runtime builder and environment variable.
- Refactored parts of the driver pipeline plumbing (e.g.,
PipelineContext) and HTTP error construction to flow typed Cosmos status/headers through the new error type.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure_data_cosmos/tests/multi_write_tests/cosmos_multi_write_retry_policies.rs | Updates status assertions to CosmosError::status_code(). |
| sdk/cosmos/azure_data_cosmos/tests/multi_write_tests/cosmos_multi_write_fault_injection.rs | Updates injected-error assertions to status_code(). |
| sdk/cosmos/azure_data_cosmos/tests/in_memory_emulator_tests/end_to_end.rs | Switches helper functions and retry logic to CosmosError APIs (status_code(), sub_status()). |
| sdk/cosmos/azure_data_cosmos/tests/framework/test_data.rs | Migrates helper result type and 429/409 matching to new error API. |
| sdk/cosmos/azure_data_cosmos/tests/framework/test_client.rs | Migrates test framework helpers to azure_data_cosmos::Result and status_code(). |
| sdk/cosmos/azure_data_cosmos/tests/emulator_tests/cosmos_response_metadata.rs | Migrates header checks from raw headers to typed ResponseHeaders from CosmosError. |
| sdk/cosmos/azure_data_cosmos/tests/emulator_tests/cosmos_query.rs | Uses typed status/substatus access instead of raw header parsing. |
| sdk/cosmos/azure_data_cosmos/tests/emulator_tests/cosmos_patch.rs | Migrates create container + status assertions to new result/error APIs. |
| sdk/cosmos/azure_data_cosmos/tests/emulator_tests/cosmos_items.rs | Migrates multiple item CRUD status assertions to status_code(). |
| sdk/cosmos/azure_data_cosmos/tests/emulator_tests/cosmos_fault_injection.rs | Migrates fault-injection status assertions to status_code(). |
| sdk/cosmos/azure_data_cosmos/tests/emulator_tests/cosmos_batch.rs | Migrates batch error status assertions to status_code(). |
| sdk/cosmos/azure_data_cosmos/src/session_helpers.rs | Switches internal helpers to crate::Result and CosmosError constructors. |
| sdk/cosmos/azure_data_cosmos/src/query/mod.rs | Query::with_parameter now returns crate::Result. |
| sdk/cosmos/azure_data_cosmos/src/query/executor.rs | Migrates executor APIs to crate::Result. |
| sdk/cosmos/azure_data_cosmos/src/models/response_headers.rs | Adds #[repr(transparent)] + driver-ref borrowing helper for typed headers. |
| sdk/cosmos/azure_data_cosmos/src/models/response_body.rs | Maps driver body conversion errors into CosmosError via crate::Result. |
| sdk/cosmos/azure_data_cosmos/src/models/resource_response.rs | Migrates into_model() to crate::Result. |
| sdk/cosmos/azure_data_cosmos/src/models/item_response.rs | Migrates into_model() to crate::Result. |
| sdk/cosmos/azure_data_cosmos/src/models/cosmos_response.rs | Migrates internal model conversion to crate::Result. |
| sdk/cosmos/azure_data_cosmos/src/models/batch_response.rs | Migrates batch model conversion to crate::Result. |
| sdk/cosmos/azure_data_cosmos/src/lib.rs | Exposes new error module and re-exports CosmosError, CosmosErrorKind, Result. |
| sdk/cosmos/azure_data_cosmos/src/feed.rs | Migrates feed iterators/pages to crate::Result and CosmosError. |
| sdk/cosmos/azure_data_cosmos/src/feed_range.rs | Migrates parsing/validation to CosmosError + crate::Result. |
| sdk/cosmos/azure_data_cosmos/src/error.rs | Adds SDK-owned CosmosError wrapper + accessors + conversions. |
| sdk/cosmos/azure_data_cosmos/src/connection_string.rs | Migrates parsing errors to CosmosError and updates tests to check .message(). |
| sdk/cosmos/azure_data_cosmos/src/clients/throughput_poller.rs | Migrates poller stream/future outputs to crate::Result and CosmosError. |
| sdk/cosmos/azure_data_cosmos/src/clients/offers_client.rs | Migrates offer helpers to crate::Result + CosmosError construction. |
| sdk/cosmos/azure_data_cosmos/src/clients/database_client.rs | Migrates database client APIs/doc examples to crate::Result. |
| sdk/cosmos/azure_data_cosmos/src/clients/cosmos_client.rs | Migrates cosmos client APIs to crate::Result. |
| sdk/cosmos/azure_data_cosmos/src/clients/cosmos_client_builder.rs | Migrates builder build to crate::Result and rewraps errors as CosmosError. |
| sdk/cosmos/azure_data_cosmos/src/clients/container_client.rs | Migrates container client public APIs to crate::Result and CosmosError constructors. |
| sdk/cosmos/azure_data_cosmos/src/account_endpoint.rs | Migrates endpoint parsing to CosmosError. |
| sdk/cosmos/azure_data_cosmos/examples/cosmos/replace.rs | Uses CosmosError predicate (is_not_found) instead of status code matching. |
| sdk/cosmos/azure_data_cosmos/examples/cosmos/read.rs | Uses CosmosError predicate (is_not_found) instead of status code matching. |
| sdk/cosmos/azure_data_cosmos/examples/cosmos/delete.rs | Uses CosmosError predicate (is_not_found) instead of status code matching. |
| sdk/cosmos/azure_data_cosmos/CHANGELOG.md | Documents new error type/result alias and backtrace capture behavior. |
| sdk/cosmos/azure_data_cosmos_driver/src/models/cosmos_status.rs | Adjusts/extends CosmosStatus predicates (is_not_found, adds conflict/412 helpers). |
| sdk/cosmos/azure_data_cosmos_driver/src/lib.rs | Exposes new error module and re-exports driver CosmosError/CosmosErrorKind. |
| sdk/cosmos/azure_data_cosmos_driver/src/error/mod.rs | Introduces driver CosmosError type, conversions to/from azure_core::Error, and Result<T>. |
| sdk/cosmos/azure_data_cosmos_driver/src/error/backtrace.rs | Adds rate-limited, lazily-resolved backtrace capture with global symbol cache. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/runtime.rs | Adds runtime builder configuration hooks for backtrace limiter policy. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/pipeline/retry_evaluation.rs | Builds typed CosmosError for service/transport/deadline outcomes and adjusts retry plumbing. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/pipeline/patch_handler.rs | Updates docs/comments referencing new HTTP-error construction path. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/pipeline/operation_pipeline.rs | Refactors pipeline invocation to use a PipelineContext aggregation struct. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs | Constructs and passes new PipelineContext into the operation pipeline. |
| sdk/cosmos/azure_data_cosmos_driver/CHANGELOG.md | Documents new driver error type/result alias and backtrace capture feature. |
| sdk/cosmos/azure_data_cosmos_driver/Cargo.toml | Adds backtrace dependency. |
| Cargo.toml | Adds workspace dependency backtrace = "0.3". |
| Cargo.lock | Locks new transitive dependencies from backtrace. |
Copilot's findings
- Files reviewed: 48/49 changed files
- Comments generated: 6
heaths
requested changes
May 20, 2026
- From<azure_core::Error> now propagates the HttpResponse error_code into CosmosStatus::sub_status so is_partition_topology_change classifies wrapped 410s correctly. - exhaustion_error preserves the caller-facing context message via with_context after the azure_core round-trip. - session_token_from_error walks the std::error::Error source chain to recover the raw_response when the cosmos error was minted via From<azure_core::Error>. - PATCH missing-ETag error uses Error::client directly so it classifies as Kind::Client. - exhaustion_error_without_source assertion updated to walk past the wrapping azure_core::Error. - Dataflow/planner/topology asserts switched from to_string() (which prefixes [Kind] and appends (status)) to message().
…e-chain downcast walks
|
Azure Pipelines successfully started running 1 pipeline(s). |
analogrelay
approved these changes
May 28, 2026
Upstream PR Azure#4477 (Public API Cleanup Pass) renamed several public types (CosmosAccountEndpoint->AccountEndpoint, CosmosAccountReference->AccountReference, FeedItemIterator->QueryItemIterator, FeedPageIterator->QueryPageIterator, IncrValue->CosmosNumber, PatchOp->PatchOperation, PatchSpec->PatchInstructions, with_master_key->with_authentication_key) and removed the ConnectionString public re-export, while this branch refactored fallible APIs to return crate::Result<T, CosmosError>. Resolution preserves both: upstream renames + structural cleanups (options normalization, non_exhaustive markers, find_offer/begin_replace operation_options param) layered over the local CosmosError-based Result type and DriverCosmosError bridge. Dropped CosmosStatus from the lib.rs error re-export (still accessible via models::CosmosStatus from the driver) to avoid a duplicate definition with the driver's pub use.
Member
Author
|
/azp run rust - cosmos - weekly |
|
Azure Pipelines successfully started running 1 pipeline(s). |
heaths
reviewed
May 28, 2026
Member
Author
|
/azp run rust - cosmos - weekly |
|
Azure Pipelines successfully started running 1 pipeline(s). |
heaths
approved these changes
May 28, 2026
Member
Author
|
/azp run rust - cosmos - weekly |
|
Azure Pipelines successfully started running 1 pipeline(s). |
tvaron3
approved these changes
May 28, 2026
Member
Author
|
/azp run rust - cosmos - weekly |
|
Azure Pipelines successfully started running 1 pipeline(s). |
analogrelay
approved these changes
May 28, 2026
Member
Author
|
/azp run rust - cosmos - weekly |
|
Azure Pipelines successfully started running 1 pipeline(s). |
analogrelay
approved these changes
May 28, 2026
simorenoh
added a commit
that referenced
this pull request
May 29, 2026
Remove dead cosmos_headers and body params from try_handle_retry_trigger_group — only needed by the !safe_to_retry abort path removed in the idempotency gate removal. Remove stale status field from Abort variant after CosmosError refactor (#4442). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the SDK's reliance on
azure_core::Errorfor Cosmos failure reporting with a typed, diagnosableErrorthat's safe to construct at high rates in production.Every error returned from the driver or SDK now carries — without any downcasts or string parsing — the typed
CosmosStatus(HTTP status + sub-status + categoricalKind), the parsedCosmosResponseHeaders(RU charge, activity id, session token, LSNs, …), the raw service response body, the sharedDiagnosticsContextfor the operation, and (where the production-safety gates allow) a captured stack backtrace. The previousazure_core::Errorremains reachable viastd::error::Error::source().Motivation
azure_core::Errorexposes failure data behind an opaque enum and string messages, which forced callers into brittle pattern matches likee.kind() == HttpResponse { status, .. }and silently dropped Cosmos-specific fields (sub-status, RU charge, activity id, diagnostics) at the boundary. That made production triage of Cosmos failures (throttles, session retries, transport classifications, end-to-end timeouts) much harder than it needs to be.Java/.NET SDKs have shown that handing back rich error objects is a major usability win, but Java/.NET also illustrate the cost of getting this wrong: when stack-frame computation runs unbounded inside exception construction, a transient backend error storm can become a sustained client-side CPU pinhole. This PR adds the rich error surface and the production-safety machinery needed to keep that surface affordable under load.
What changed
New typed
Errorazure_data_cosmos_driver::error::Error) is the canonical type. SingleArc<ErrorInner>soResult<T, Error>stays pointer-sized;Cloneis a refcount bump.azure_data_cosmos::Error) is a#[repr(transparent)]re-export of the driver type, plus the crate-wideazure_data_cosmos::Result<T>alias.into_model/single/items,FromStrimpls onCosmosAccountEndpoint/ConnectionString/FeedRange) now returnazure_data_cosmos::Result<T>instead ofazure_core::Result<T>.status() -> CosmosStatus,status_code(),sub_status(),kind(),cosmos_headers() -> Option<ResponseHeaders>,diagnostics() -> Option<&Arc<DiagnosticsContext>>,response_body() -> Option<&[u8]>,backtrace() -> Option<&str>, plus the usualis_*predicates onCosmosStatus.CosmosStatusconstants for well-known status/sub-status pairs (e.g.CosmosStatus::TRANSPORT_GENERATED_503,READ_SESSION_NOT_AVAILABLE,RU_BUDGET_EXCEEDED,CROSS_PARTITION_QUERY_NOT_SERVABLE) so call sites read asassert_eq!(err.status(), CosmosStatus::CROSS_PARTITION_QUERY_NOT_SERVABLE).Boundary mapper
From<azure_core::Error>classifies into the most specificCosmosStatusavailable —HttpResponsewins on its real wire status; otherwise theazure_core::ErrorKindplus a downcast walk of.source()(reqwest/hyper/h2/io) refines into synthetic sub-statuses (e.g.TRANSPORT_DNS_FAILED,TRANSPORT_HTTP2_INCOMPATIBLE,AUTHENTICATION_TOKEN_ACQUISITION_FAILED). The originalazure_core::Erroris preserved in the source chain.The driver transport layer carries typed
Errorend-to-end; nothing wraps a Cosmos error back into anazure_core::Error, so the typed payload is never lost on the wire.Stack backtrace capture — production-safe by construction
Two-tier cost model with two independent rolling-1-second limiters:
Backtrace::capture()— IP-only, microseconds1000 / sAZURE_COSMOS_BACKTRACE_CAPTURES_PER_SECOND/with_max_error_backtrace_captures_per_secondbacktrace()read — cache-missed frames only5 / sAZURE_COSMOS_BACKTRACE_RESOLUTIONS_PER_SECOND/with_max_error_backtrace_resolutions_per_secondAdditional safety:
None.Errorinstance, so logging + telemetry + panic-message paths see the same answer.Error(e.g. transport-layer re-wrap) inherits the inner backtrace instead of capturing fresh, doubling the effective budget on retry-heavy paths.Display/Debugimpls{e}— bare message (matchesanyhow/azure_core/std::ioconvention).{e:#}— header ([Kind] status/sub (name)) + source chain (Display) + diagnostics block + backtrace.{e:?}— header + source chain (Debug) + diagnostics. No backtrace to keeptracing::error!(?e)cheap.{e:#?}— full report including backtrace; alternate flag cascades to source entries ({src:#?}) and to theDiagnosticsContext({diag:#?}), so wrapped errors and diagnostics surface their pretty multi-line debug layout.Misc
CosmosStatus::Display/Debugnow prefix the categorical[Kind](e.g.[Service] 429/3200 (RUBudgetExceeded)). TheDeserializeimpl tolerates the[Kind]prefix for JSON round-trip stability.Error::with_context(prefix)for enriching mapper-classified errors with operation-specific context (single-allocation prepend, all typed fields preserved).unsafefrom the SDKResponseHeaderswrapper —Error::cosmos_headers()now returns an ownedOption<ResponseHeaders>via a cheap clone (cold path) instead of the previousrepr(transparent)reference transmute.Breaking changes (SDK)
azure_data_cosmos::Result<T>. Callers matching onazure_core::ErrorKindshould switch to the typed accessors (e.status_code(),e.sub_status(),e.status() == CosmosStatus::…,e.cosmos_headers(),e.diagnostics()). The underlyingazure_core::Errorremains reachable viastd::error::Error::source().Error::cosmos_headers()returnsOption<ResponseHeaders>(by value, cloned) instead ofOption<&ResponseHeaders>.CosmosStatus::Displayoutput now includes the[Kind]prefix; any diagnostics consumers that parsed the previous bare"429/3200 (…)"shape need to either accept the new format or use the typed accessors.Testing
cargo test -p azure_data_cosmos_driver --lib --all-features— 1670+ tests green (added coverage for backtrace limiter behavior, source-chain inheritance,Display/Debugformat variants and alternate-flag cascade, named-constant comparisons, sub-status serialization with and without well-known names, JSON snapshot updates for the new status format).azure_data_cosmos,azure_data_cosmos_driver,azure_data_cosmos_perf,azure_data_cosmos_benchmarks.Result<()>alias.Notes for reviewers
azure_data_cosmos_driverhas a "when to adjust which" section.azure_core::Error, please flag it — that's a regression that would lose the typed payload.Error::client/Error::serialization/Error::configurationare#[doc(hidden)] pubso the SDK wrapper crate can construct typed errors; they are not part of the public surface.Backtrace machinery benchmarks
cargo bench -p azure_data_cosmos_benchmarks --bench backtrace_captureReviewed against the production-readiness changes (opt-in capture, two-limiter model, source-error backtrace inheritance,
OnceLockper-instance render cache).Changes to the bench harness
capture/cosmos/inherit_from_sourceto cover the re-wrap path onCosmosErrorBuilder::with_arc_source(cosmos_err). This path skips a fresh stack walk and inherits the source'sBacktrace, which is a key production optimization but was previously unmeasured.capture/cosmos/throttle_deniedto make explicit that it also represents the default production state (capture is opt-in; withRUST_BACKTRACEunset the same fast-denial path runs on every construction).prime_resolution_cache()call (left over from when the limiter capacity was 1) — the unbounded limiter only needs one prime pass.Results (Windows, release, 100 samples per group)
capture/cosmos/unboundedcapture/cosmos/throttle_deniedAtomicU64CAS denial. Also the default-off production cost whenRUST_BACKTRACEis unset. ~750× cheaper thanunbounded, ~915× cheaper thanstd::backtrace::Backtrace::force_capture.capture/cosmos/inherit_from_sourceCosmosErrorBuilderbuild (alloc + status + message +Arc<source>clone) that inherits the source's backtrace — i.e. does not re-capture. Effectively equal tostd::force_capturealone, so the entire builder overhead disappears into the cost of one stack walk that we are deliberately skipping.capture/std/force_capturestd::backtrace::Backtrace::force_capture()baseline.render/cosmos/cachedOnceLockhit on the per-instance render cache — the steady-state cost of everyCosmosError::backtrace()call after the first one. ~735× cheaper thanstd::backtrace::to_string.render/cosmos/fresh_warm_cacheBacktraceper iter, process-global IP-keyed frame cache hot — pays cache lookup only, no symbol resolution, no budget consumption.render/cosmos/fresh_cold_resolution_deniedBacktraceper iter with the resolution limiter exhausted — proves the denial fast-path is cheaper than even a fully cached resolution. Validates the "no partial backtraces" guarantee.render/std/to_stringstd::backtrace::Backtrace::to_string()baseline — std has no per-instance cache; every call re-walks debug info.Conclusions
RUST_BACKTRACEis unset (a single CAS), versus 1.36 µs for the captured path. Default-off error storms cost the same as an atomic decrement.CosmosErroradds the cost of the builder plumbing only — no second stack walk — so the pipeline's re-wrap sites (transport → service, status promotion, etc.) do not multiply backtrace cost across nested errors.CosmosError::backtrace()call returns in ~21 ns, versus 15 µs forstd::backtrace— a structural difference of ~700×, mattering on any path that formats the same error multiple times (e.g.tracing::error!("{e}")+Result::unwrappanic message).All benches and the production code build clean (
-D warnings, all features).